home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
CD ROM Paradise Collection 4
/
CD ROM Paradise Collection 4 1995 Nov.iso
/
science
/
normix21.zip
/
NORMIX21
/
NORMIX.DOC
< prev
next >
Wrap
Text File
|
1995-01-10
|
24KB
|
419 lines
PC-NORMIX (Version 2.1)
Copyright (C) John H. Wolfe, 1995. All rights reserved.
.......................................................................
A. Identification
1. Title= Cluster and pattern analysis of normal mixtures
2. Identification= G7 PC-NORMIX (Version 2.1)
3. Category= Multivariate Statistics
4. Compiler/Operating System= The source code has been tested on
four different computer/compiler/operating system combinations and
works on all four with little or no modification.
Computer: Compiler: Operating System:
---------- -------------------- -----------------
80286 Microsoft Fortran 5.0 MSDOS
80386/80486 Microsoft PowerStation MSDOS Extended
VAX 11/780 f77 unix
IBM 4381 FORTVS CMS
5. Date= January 1995
6. Programmer= John H. Wolfe
B. Purpose
This program is a tool for the user seeking maximum likelihood
estimates of the parameters of a mixture of multivariate normal
distributions. The program solves the equations for maximum likelihood,
using an iterative algorithm. Because mixture problems usually have
multiple relative maxima, the program will produce different results,
depending on the initial estimates supplied by the user. One should
run a variety of other clustering programs first, and then use their
results as initial estimates for NORMIX. This procedure has two primary
benefits:
1. It will evaluate different solutions produced by other clustering
programs by computing their likelihood.
2. Given a solution from another clustering program, it will proceed
to generate a "better" solution with greater likelihood.
If the user does not input his own initial estimates, a default
hierarchical grouping procedure will generate initial estimates
for the iterative algorithm. It is not recommended that the user
rely solely on this default option.
An option permits the user to specify whether the covariance
matrices within types will be the same or different. The unequal
covariance option requires larger sample sizes for reliable results,
in most cases. Also, the unequal covariance case has many
singularities in the likelihood function. Fortunately, Peters and
Walker (1978) showed that almost surely the likelihood function has
a unique maximum within a neighborhood of the population parameters
of the mixture for sufficiently large N.
This program estimates more parameters than clustering procedures
which assume a Euclidean distance. As in multiple regression, the more
parameters that are estimated, the larger the sample size should be.
If a common covariance matrix is assumed, a good sample size might
be 20 times the number of variables. If the clusters have different
covariance matrices, a good sample size might be 20 times the number
of clusters times the number of variables.
References.
Wolfe, John H. (1970).
Pattern clustering by multivariate mixture analysis.
Multivariate Behavioral Research, 5, 329-350.
Wolfe, John H. (1978).
Comparative cluster analysis of patterns of vocational interest.
Multivariate Behavioral Research, 13, 33-44.
Peters, B. C. & Walker, H. F. (1978). An iterative procedure for
obtaining maximum-likelihood estimates of the parameters for a
mixture of normal distributions. SIAM J. Appl. Math., 35, 362-378.
C. Usage
1. Storage requirements= variable, depending on the data.
The requirements are displayed on the screen, prior to execution of
the analysis. On the UNIX or CMS systems, storage may be increased
by recompiling the main program (NORMIX.FOR) with larger dimension
for the A array and the DATA statement that follows DIMENSION A().
This version uses extended memory, as needed, but large problems
could fail on computers with small memories.
2. Restrictions= the number of types must not exceed 20. There
are no fixed limits on the number of variables or sample size.
However, all data and arrays must fit into RAM.
3. Environment= The PC executable file PCNORMIX.EXE requires an 80386
or higher CPU.
4. Error messages= Bad initial estimates or diverging iterations
can cause strange estimates of the parameters- such as
negative variances or singular correlation matrices which
result in system diagnostics.
5. Time= Observed run times for the four sample problems accompanying
this program are given below.
SAMPLE PROBLEM CHARACTERISTICS AND RUN TIMES
Characteristic Irismap Irismix Artificial SVIB
------------------------------------------------------------
Equal Covariances Yes No No Yes
Sample Size 150 150 225 113
No. of Variables 4 4 2 22
No. of Types 3-4 3-4 1-4 13-15
Hypothesis:Iterations 3:27 3:13 2:21 13:3
4:19 4:29 3:27 14:2
4:28 15:2
CPU Seconds
------------------------------------------------------------
IBM 4381 7.3 8.4 12.4 28.5
VAX 11/780 55.6 56.1 86.0 188.8
80286 (8 MHz) 795.0 1136.0 1731.0 2860.0
80286+80287 (8 MHz) 157.0 204.0 297.0 517.0
80386 (33 MHz) 184.6 279.2 385.3 501.1
80486DX33 6.9 6.4 9.2 21.2
80486DX66 3.8 3.5 4.8 11.8
On the IBM 360/65, the CPU time in minutes was given by
the following two formulas:
Common covariance matrix option:
Minutes =(1.0e-6)*( 1667*(2*t-7) +m*(2.865*n*v +4.961*m)
+4.25*t*v**2 +2.12*t*v**3 +67*n*t
+1.23*n*i*t*(v+2) ) .
Different covariance matrices option:
Minutes =(1.0e-6)*( m*(2.865*n*v +4.961*m) +4.25*t*v**2
+67*n*t +0.62*n*i*t*(v+2) ) .
where,
t = number of types +1
v = number of variables
i = number of iterations (usually 40-80 )
m = number of kmeans
n = sample size
Time for the 33MHz 486 DX is estimated as 1/25 of the above.
6. Files=
Logical I/O unit numbers are assigned to files as follows:
15 Input-Form statements (filename specified interactively,
default: "thisjob")
11 Input data (filename specified on Input-Form; if blank,
data are read from unit 15 following the format
statement on the Input-Form)
12 Printout of the analysis (filename specified on Input-Form,
default: "prinout")
3 "kc3temp" Scratch unit containing factor scores
9 "kv9temp" Scratch unit containing raw data
4 "discrim" = the discriminant scores
7 "dumpout" the parameter estimates if iteration limit is
reached. These can be used to continue the
analysis by including them as initial estimates
on a new Input-Form.
7. Input
Input to the program comes from three sources:
a. Keyboard: one line containing the file name of the Input-Form.
b. Input-Form file.
c. Data file (filename as specified on the Input-Form)
The program prompts the user for the name of the file containing the
input-form. However, by using the batch program normix.bat, one can
put the input-form file directly on the command line as the first
argument. One can also redirect the console output to a script file
by using an optional second argument. For example the command
normix svib script
would read the input-form from the file "svib.inp" and redirect the
console output to the file named "script". For a further example, try
examples
which runs the batch file "examples.bat" to run all of the sample
problems supplied with this package.
***Directions for Filling Out the Input-Form****************
The Input-Form is a set of control statements by means of
which the user specifies the dimensions of the data and
the options he chooses. The alphabetic contents of the
Input-Form are ignored by the program, but provide a
useful guide to the numerical contents. The file "form.inp"
accompanying this documentation is a blank template with the
alphabetic contents of the Input-Form printed in. The
user should copy the template file onto another file and edit
it, filling in the appropriate numerical values and other
information. Please note that this form does NOT consist of
key words followed by free-form input, as in SAS or SPSS. The
parameters must be entered in the exact columns specified.
In the Input-Form layout below, fill in values where zeros are
************************************************************************
USER=**** DATE USED=24/07/90 01
TITLE=*** 02
COMMENTS= 03
COMMENTS= 04
NUMBER OF VARIABLES=00 SAMPLE SIZE=0000 05
HYPOTHESES FOR NO. OF TYPES=00,00,00,00,00,00,00,00,00,00,00,00,00, 06
DIFFERENT COVARIANCE MATRIX IN EACH TYPE=0 MINIMUM CLUSTER SIZE=000 07
CONTINUE WITH MORE TYPES IF PROBABILITY OF NULL HYPOTHESIS IS BELOW.000 08
NO. OF INITIAL KMEANS GENERATED=000 MAXIMUM HIERARCHY PRINTED=000 09
MAXIMUM ITERATIONS=000 PRINT ITERATION=0 10
DATA INPUT FILE NAME= 11
PRINTOUT FILE NAME= 12
DATA FORMAT= ( ) 13
INITIAL ESTIMATES, TYPES=00 MEANS ARE READ=0 STD.DEVS=0 CORRELATIONS=0 14
************************************************************************
********Additional Remarks on the Input-Form****************
The single-digit zeros in lines 07 and 10 are logical
variables, i.e., 1 means yes and 0 means no.
Standard default options are invoked when lines 06 through
12 are left blank or zero, but this usage is not recommended.
Lines 1-4 will be printed at the top of each page, and
the user should fill them with descriptive comments
concerning his particular problem.
In line 5, col 21-22, enter the number of numeric variables
to be analyzed, not including case IDs, if any.
In line 6, enter the hypothesized numbers of types in
ascending order. If all entries are 00, then 01 ...20
will be used.
In line 7, col. 42, the computer will assume equal
covariances unless 1 is entered. If different covariance
matrices are to be estimated for each type, then
col. 66-68, line 7 specifies the minimum number of points
a cluster must have for a covariance matrix to be
estimated for it. If its size falls below this minimum,
its covariance matrix will be re-initialized to the
average within-group matrix.
In line 8, the program will proceed to a greater number
of hypothesized types if the (pseudo-) chi-square test is
significant for the likelihood ratio for the current
hypothesis/preceding hypothesis. If .999 is entered, the
program will ignore this cutoff and continue with the next
hypothesis. WARNING: This significance test is known to be
incorrect.
Col. 34-36 in line 9 is the parameter N in subroutine
Kmean and is the size of the sub-sample used to
generate initial estimates. The default value
is sample size or 2000, whichever is smaller. If the initial
estimates are input by the user, the hierarchical grouping may
shortened by inserting a positive value greater than or equal
to 10. (A positive value smaller than 10 will cause the program
to hang.) A very useful way to generate a variety of initial
estimates is to vary this parameter.
Col. 65-67 in line 9 specifies the number of hierarchical
initial clusters displayed on the .XXXXX skyline diagram.
The default is 30. Entering 30, or any other number, will
produce additional printouts of the first two iterations of
the skyline diagrams, and also a printout of the cluster means
at each stage of the grouping.
If line 10, col.20-22 is left blank, the program will
iterate until convergence is obtained. This generally
takes 50-100 iterations. If 1 is entered in col. 42,
the results of each iteration will be printed.
Line 11, columns 22-62 is the file name for input. In MS-DOS
and Unix, this is the actual file name as it appears in the
directory listing. [On the IBM mainframe CMS, this is
a 7-character internal file name which must be related to
a corresponding external file name by a CMS FIledef statement
preceding invocation of the program.]
When the filename on this line is left blank, the data input
defaults to the same file as the input form. In this case, the
data immediately follow the format statement on line 13 and
precede line 14 specifying the initial estimates to be read.
Line 12, columns 20-60 is the file name for printout. In MS-DOS
and Unix, this is the actual file name as it will appear in the
directory listing. [On the IBM mainframe CMS, this is
a 7-character internal file name which must be related to
a corresponding external file name by a CMS FIledef statement
preceding invocation of the program.]
When the filename is left blank on this line, the printout
defaults to the filename "prinout".
Line 13 contains a variable format for reading the data. An
option allows reading case IDs of up to 24 alphanumeric
characters, using a format entry such as A24. A9 would read
a nine-character case ID. If a case ID is to be read, it must
be the first variable read. For example (t71,a9,t1,3f5.2/2f5.2)
would read a nine-character case ID from columns 71-79, then
would read three numeric variables from columns 1-15, then two
more numeric variables from the next record in columns 1-10.
If no case IDs are to be read, the A format is omitted.
***IMPORTANT*** Several runs should be done with a variety of initial
estimates input starting with line 14. Output from
other clustering programs should be used, if available.
Also, one should do several runs with different values
for the number of KMEANS (line 9, cols 34-36).
Line 14 follows the data and precedes each set of
initial estimates supplied by the user. In col. 26-27,
enter the hypothesized number of types in the set of
initial estimates.
In columns 44, 55, and 70, enter 1 if initial means are
to be read, if initial standard deviations are to be
read, and if correlations are to be read, respectively.
*****************Setup for Initial Estimate *********************
All values are entered in free format in the following order:
Line 1a Proportion of population in type 1
Line 1b Means of variables in type 1
Line 2a Proportion of population in type 2
Line 2b Means of variables in type 2
... ... ... ... ...
... ... ... ... ...
Line ra Proportion of population in last type
Line rb Means of variables in last type
Line c Standard Deviations within type
Line d-1 First row of correlations within type
Line d-2 Second row of correlations within type
... ... ... ... ...
Line d-m Last row of correlations within type
If the option for different covariance matrices is selected in
line 7 of the input form, then a set of standard deviations and
correlation Lines (c-d) follows the means Line for each type.
The complete correlation matrix must be read in, but only the
upper half is used by the program. Therefore zeros (or any other
values) may be used as place-holders in the lower half of the matrix.
8. Output
--Hierarchical Grouping---
This routine prints three iterations as follows:
1) Mahalanobis distances from total covariance matrix.
2) Mahalanobis distances from within-group covariances
of 10 groups found in step 1.
3) Mahalanobis distances from within-group covariances
of 10 groups found in step 2.
Rank= line number of the printout
Item= individual or object being clustered
Kmean= the kmean cluster membership of the item
the first item of each kmean begins with a - sign.
Stage= the number of groups remaining after this cluster
was merged with the preceding cluster.
The right hand side of the page is read columnwise. Each
column represents a stage in the hierarchical grouping.
Each group begins with a period and continues with an X.
--Iteration 0 ---
This printout gives the initial estimates used to begin the
iterative solution to the likelihood equations. These
estimates will be the same as the corresponding
hierarchical grouping unless the user includes his own
initial estimates.
--Iteration-last--
gives the converged solution to the likelihood equations.
--Probabilities of type membership--(self-explanatory)
--Discriminant Functions---
Gives the weights to apply to the raw scores so that
discriminant scores will have maximal discrimination between
groups and the identity matrix for the within-group
covariances.
--Cluster Members---
This is a list of case Seq.#s and IDs sorted by cluster.
--Printer Plot---
Each point has a number or letter designating the type for
which this individual has the highest probability of
membership. The first 9 clusters are identified by the
digits 1-9. The next 11 clusters are identified by the
letters A-K. An * indicates a place where there are two
individuals with different cluster membership. Points which
lie beyond the boundaries of the graph are projected and
plotted at the boundaries.
--Summary of Likelihood Statistics
At the end of printout of the last hypothesized number of types
is a summary page giving the log likelihoods for each hypothesis.
[After doing 5-6 runs with different initial estimates, I like to
copy this column into a spreadsheet; one column for each initial
estimate, then create a new column that is the maximum across all
initial estimates. Create another column that is the difference
between the current maximum and the maximum for the line above.
(The pseudo chi-square is proportional to this.) I make my final
decision as to how many clusters there are by plotting these values
and taking the one that is just above the line where the values
level off.]
--Notes on Printing and Editing
The printout assumes a page of 60 lines by 120 characters wide.
To allow for page numbers and headers, one should use a listing
program that prints 66 lines by 133 characters. I like to use
Norton Utilities lp /133 listing program.
Normix produces voluminous output, which most users will
want to edit down. Many editors cannot handle long text files.
Norton Desktop for Windows contains a desktop editor that works
for long Normix printouts. There are undoubtedly other editors
that could be used.
.......................................................................
D. Trouble Reports:
Results are not guaranteed, and the author assumes no
liability for any problems the user may experience in using
this program. However, I would be greatly interested in hearing
of people's experiences with it. Please report any questions,
problems, or difficulties to
John H. Wolfe Internet: wolfe@acm.org
4310 Hill Street Telephone: (619) 222-5860
San Diego, Ca. 92107.
If this address doesn't work, try the Membership directories for
the American Statistical Association or the Classification Society.